PSTAT 100 Course Project
Andrew Hansen & William Markevitch

Data Description¶

For this project, we will be using 2 datasets provided by this course: the first being the ClimateWatch historical emissions data, and the second being from merging several life expectancy smaller datasets, mostly collected from World Bank Open Data. For the Life dataset, the result is essentially a convenience sample, but descriptive analyses without inference are nonetheless interesting and suggestive. Meanwhile, the ClimateWatch dataset reflects a representative sample of the world, as it is analyzed as a subset of a population that seeks to accurately reflect the characteristics of the world.

In regards to the emissions, we delve into the historical greenhouse gas emissions dataset sourced from Climate Watch. This dataset serves as a comprehensive record of global emissions, capturing valuable insights into the evolving landscape of human-induced climate impact over time. The timeframe of the dataset spans historical periods, allowing for an exploration of trends and patterns in emissions over time. In the long run, this comprehensive historical context allows researchers, policymakers, and the public to discuss the evolution of emissions, identify critical patterns, and assess the effectiveness of curing strategies. Researchers can leverage this dataset to analyze and communicate the urgency of addressing climate change, advocating for sustainable practices and policies. The emissions dataset contains 195 observations of countries, along with their Green House Gas emission over the years 1990 to 2019. The data brings together dozens of datasets to let users analyze and compare the Nationally Determined Contributions (NDCs) and access historical emissions data.

The Life dataset originates from The Data World Bank, from the United Nations Population Division, World Population Prospects: 2022 Revision, or derived from male and female life expectancy at birth from sources, such as Census reports and other statistical publications from national statistical offices, and Eurostat Demographic Statistics, United Nations Statistical Division, Population and Vital Statistics Report, U.S. Census Bureau: International Database, and Secretariat of the Pacific Community: Statistics and Demography. These organizations provided us with a historical in-depth dataset from the years 2000, 2010, 2015, and 2019 of 158 countries, covering the total life expectancy of their population of that year, the male and female life expectancy, GDP per capita of countries, region, sub-region, and population. The dataset is a comprehensive resource for exploring the dynamics of global population health, economic development, and demographic trends. By including data on total life expectancy, male and female life expectancy, GDP per capita, and additional socio-economic factors, this dataset offers an in-depth understanding of the intersections between health outcomes and economic factors across 158 countries. Analysts can leverage this dataset to identify patterns, disparities, and potential correlations between life expectancy, and policymakers can inspect economic well-being and demographic variables to make essential decisions on the country's future.

Question of Interest¶

While examining the datasets, we noticed an overlap in the years of observations described and collected. This prompted our interest in exploring potential patterns between the two sets. These datasets were collected independently, offering an opportunity to investigate an unintended aspect. Upon closer inspection, we formulated a question: Does the evident increase in Greenhouse Gas Emissions over the past two decades correlate with the life expectancy of people worldwide? If so, can we reasonably conclude that one is causing the other, disregarding other factors? If not, are there other explanations, like a rise in GDP or population size? We emphasized not inferring causality from our datasets, focusing on understanding correlations and interactions among the variables involved.

First we need to install (or update) all necessary libraries for this project.

In [76]:
%%capture # If the libraries don't properly download, remove this %%capture

!pip install numpy
!pip install pandas
!pip install altair
!pip install geopandas
!pip install "vegafusion[embed]>=1.4.0"

Now we import all these libraries and use "alt.data_transformers.enable("vegafusion")" to enable plotting of large data sets.

In [1]:
import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm
import geopandas as gpd

alt.data_transformers.enable("vegafusion")
Out[1]:
DataTransformerRegistry.enable('vegafusion')

Data analysis¶

Our first objective was to break down the datasets one at a time, check for correlation, patterns in their respective data and draw assumptions from them to then merge them. This meant first exploring the datasets one-by-one:

Exploration of the emissions dataset¶

In this portion of the project, we introduce our emissions dataset and use it to investigate trends of emissions over time

In [2]:
emissions = pd.read_csv('data/historical_emissions/historical_emissions.csv')
emissions.head()
Out[2]:
Country Data source Sector Gas Unit 2019 2018 2017 2016 2015 ... 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990
0 World Climate Watch Total including LUCF All GHG MtCO₂e 49758.23 49368.04 48251.88 47531.68 46871.77 ... 35101.90 35099.21 35537.18 34179.33 33805.61 33015.04 32729.06 32588.09 32670.51 32523.58
1 China Climate Watch Total including LUCF All GHG MtCO₂e 12055.41 11821.66 11385.48 11151.31 11108.86 ... 4028.58 4095.97 3977.65 3982.11 3960.71 3557.37 3397.80 3168.05 3039.15 2891.73
2 United States Climate Watch Total including LUCF All GHG MtCO₂e 5771.00 5892.37 5689.61 5743.85 5665.21 ... 6210.12 6208.83 6160.86 5901.00 5729.69 5661.57 5567.55 5456.12 5372.08 5417.32
3 India Climate Watch Total including LUCF All GHG MtCO₂e 3363.60 3360.56 3215.07 3076.48 3003.07 ... 1440.38 1362.33 1331.88 1272.74 1223.65 1158.48 1114.22 1081.28 1056.25 1002.56
4 European Union (27) Climate Watch Total including LUCF All GHG MtCO₂e 3149.57 3295.53 3379.38 3364.77 3019.49 ... 3874.40 3949.25 3983.29 4058.46 3947.64 3894.63 3908.95 3981.08 4120.94 4187.90

5 rows × 35 columns

We first melt the emissions dataframe from having 20 columns of years into just one singular column called 'Year' so that we can use it in our analysis.

In [3]:
# Melt the dataframe so that the years is in one column only
emissions_melted_year = emissions.melt(id_vars=['Country', 'Data source', 'Sector', 'Gas', 'Unit'], var_name='Year', value_name='Emissions')
emissions_melted_year.head()
Out[3]:
Country Data source Sector Gas Unit Year Emissions
0 World Climate Watch Total including LUCF All GHG MtCO₂e 2019 49758.23
1 China Climate Watch Total including LUCF All GHG MtCO₂e 2019 12055.41
2 United States Climate Watch Total including LUCF All GHG MtCO₂e 2019 5771.00
3 India Climate Watch Total including LUCF All GHG MtCO₂e 2019 3363.60
4 European Union (27) Climate Watch Total including LUCF All GHG MtCO₂e 2019 3149.57

To get an understanding of the emission trends occuring in our data, we graph the emissions against the years for every country in our set.

In [5]:
# Plot all emissions of all years
emission_plot = alt.Chart(emissions_melted_year).mark_line().encode(
    x=alt.X('Year:N', title='Year'),
    y=alt.Y('Emissions:Q', title='Emissions'),
    color='Country:N',
    tooltip=['Country', 'Year', 'Emissions']
).properties(
    title='Emissions Over Time for Each Country',
    width=500,
    height=300
)

emission_plot
Out[5]:

In the displayed graph, we identified two outliers in our dataset, namely 'World' and 'European Union.' These entries encompass a larger collection of nations, introducing inconsistency compared to individual countries. To ensure data consistency, we opted to exclude both 'World' and 'European Union' from our melted dataset

In [7]:
# Remove 'World' and 'European Union' from dataset
world_col = emissions_melted_year['Country'] == 'World'
eu_col = emissions_melted_year['Country'] == "European Union (27)"
emissions_melted_year_without_world = emissions_melted_year[~world_col]
emissions_melted_year_countries_only = emissions_melted_year_without_world[~eu_col]
/tmp/ipykernel_705/586745506.py:5: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  emissions_melted_year_countries_only = emissions_melted_year_without_world[~eu_col]

Another way to visualize the increasing trend in emissions is to explore a graph illustrating the total emissions grouped by years, as shown below. This representation is akin to presenting the integral of our world line from the previous graph in bar chart form.

In [8]:
# Emissions over time
emissions_by_year = alt.Chart(emissions_melted_year_countries_only).mark_bar().encode(
    x=alt.X('Year:N', title='Year'),
    y=alt.Y('Emissions:Q', title='Emissions', scale=alt.Scale(zero=False)),
).properties(
    title='Total Emissions Over the Years',
    width=500,
    height=300
)

emissions_by_year
Out[8]:

For enhanced data visualization, we chose to generate a heatmap overlay on the world map, utilizing the geopandas library and focusing on the countries included in our exclusive dataset of individual countries.

In [9]:
# Calculate mean emissions 
country_means = emissions_melted_year_countries_only.groupby("Country")[["Emissions"]].mean().reset_index()  

# Join with geo dataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
map_data = world.merge(country_means, left_on="name", right_on="Country", how="left")

# Create Altair choropleth
choropleth = alt.Chart(map_data).mark_geoshape().encode(
    color=alt.Color('Emissions:Q', scale=alt.Scale(scheme='greens')),
    tooltip=['Country:N', 'Emissions:Q']
).project(
    type='equirectangular'
).properties(
    title = 'Mean Emissions',
    width=500,
    height=300
)

choropleth.configure_legend(
    title = None
)

choropleth
/tmp/ipykernel_705/1801656095.py:5: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
Out[9]:

The map highlights the absence of certain countries from the heatmap. To address this discrepancy, we will conduct a comparison between the merged datasets, specifically examining countries that differ between the two datasets. This discrepancy arises as a result of our merging process, which was based on country names.

In [10]:
mismatches = set(world['name']) - set(country_means['Country'])
print("Mismatches between 'name' and 'Country':", mismatches)
Mismatches between 'name' and 'Country': {'eSwatini', 'United States of America', 'Antarctica', 'Puerto Rico', 'Eq. Guinea', 'Greenland', 'Kosovo', 'New Caledonia', 'Dem. Rep. Congo', 'Bosnia and Herz.', 'Congo', 'North Macedonia', 'Czechia', 'Central African Rep.', 'Fr. S. Antarctic Lands', 'W. Sahara', 'S. Sudan', 'Falkland Is.', 'Somaliland', 'Turkey', 'N. Cyprus', 'Dominican Rep.', 'Solomon Is.', 'Taiwan', 'Palestine'}

Having identified the countries with discrepancies, we proceed to conduct a comprehensive analysis. This involves distinguishing between discrepancies arising from variations in spelling or punctuation and those resulting from countries not being included in our dataset but present in the world dataset sourced from geopandas. To rectify these issues, we utilize dictionaries and employ methods such as .replace() or filter out the necessary countries for removal.

In [11]:
# Calculate mean emissions 
country_means = emissions_melted_year_countries_only.groupby("Country")[["Emissions"]].mean().reset_index()  

# Join with geo dataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Manually standardizing names for Country mismatches

# Countries to change name
replace_dict = {
    'Czechia': 'Czech Republic',
    'Congo': 'Republic of Congo',
    'Bosnia and Herz.': 'Bosnia and Herzegovina',
    'eSwatini': 'Eswatini',
    'Somaliland': 'Somalia',
    'N. Cyprus': 'Cyprus',
    'S. Sudan': 'South Sudan',
    'Dem. Rep. Congo': 'Democratic Republic of the Congo',
    'Eq. Guinea': 'Equatorial Guinea',
    'Dominican Rep.': 'Dominican Republic',
    'Central African Rep.': 'Central African Republic',
    'Solomon Is.': 'Solomon Islands',
    'United States of America': 'United States'
}

# Replace
world['name'] = world['name'].replace(replace_dict)

# Countries to drop 
drop_countries = {
    'New Caledonia',
    'Turkey',
    'Puerto Rico',
    'W. Sahara',
    'Taiwan',
    'Falkland Is.',
    'Greenland',
    'Palestine',
    'Fr. S. Antarctic Lands',
    'Antarctica',
    'Kosovo',
    'North Macedonia'
}

# Filter
world = world[~world.name.isin(drop_countries)]

map_data = world.merge(country_means, left_on="name", right_on="Country", how="left")

# Create Altair choropleth
choropleth = alt.Chart(map_data).mark_geoshape().encode(
    color=alt.Color('Emissions:Q', scale=alt.Scale(scheme='greens')),
    tooltip=['Country:N', 'Emissions:Q']
).project(
    type='equirectangular'
).properties(
    title = 'Mean Emissions',
    width=500,
    height=300
)

choropleth.configure_legend(
    title = None
)

choropleth
/tmp/ipykernel_705/2537791032.py:5: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
  world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
Out[11]:

To ensure the map is complete, let's review the data below and confirm that there are no more differences between the two datasets.

In [12]:
mismatches_after = set(world['name']) - set(country_means['Country'])
print("Mismatches between 'name' and 'Country':", mismatches_after)
Mismatches between 'name' and 'Country': set()

Returning to our line plot showcasing total emissions over time for each country, we encountered a challenge. The majority of smaller emissions from various countries were clustered at the bottom of the graph. To address this, we opted to highlight the top 10 highest emitting countries while combining the rest into a single category labeled 'Others.' This approach enables us to create an aesthetically pleasing plot depicting the log-transformed emissions of the top 10 countries over the years. Furthermore, for added context, we can visualize the cumulative greenhouse gas emissions over the years for all countries, revealing a consistent upward trend with a notable pause around 1988.

In [38]:
# Keep all top 10 emitting countries and group the rest as 'Others'
top_10_countries = emissions_melted_year_countries_only.groupby('Country')['Emissions'].sum().nlargest(10).index
top10_em = emissions_melted_year
top10_em['Country'] = top10_em['Country'].apply(lambda x: x if x in top_10_countries else 'Other')
top10_em = top10_em.groupby(['Country','Year'])['Emissions'].mean().reset_index()

#Graph emissions over time
top10_emissions_plot = alt.Chart(top10_em).mark_line(point=True).encode(
    x=alt.X('Year:N', title='Year'),
    y=alt.Y('Emissions:Q',scale=alt.Scale(type='log'), title='Emissions (log)'),
    color='Country:N',
    tooltip=['Country', 'Year', 'Emissions']
).properties(
    title='Emissions Over Time for Top Emitting Countries',
    width=400,
    height=300
)

top10_emissions_plot
Out[38]:

Examining the total emissions across our time period for each of the top emitting countries, we can enhance our visualization by representing this information in a bar chart. This format allows for a clearer depiction of the significant drop-off in emissions between the highest contributors.

In [14]:
emissions_top_10_table = emissions_melted_year_countries_only.groupby('Country')['Emissions'].sum().nlargest(10).reset_index()

#Top 10 most Emitting countries
emission_top_countries = alt.Chart(emissions_top_10_table).mark_bar().encode(
    x=alt.X('Country:N' ,sort='-y', title=''),
    y=alt.Y('Emissions:Q', title='Emissions'),
).properties(
    title='Top 10 Emission Countries Over the Years',
    width=500,
    height=300
)
emission_top_countries
Out[14]:

Exploration of the life dataset¶

In this portion of the project, we introduce our life dataset and use it to investigate correlations population and life expectancy across regions and sub-regions

In [15]:
y = ['2019','2015','2010', 'Country','Data source','Sector','Unit']
emissions_intersect = emissions[emissions.columns.intersection(y)]
emissions_melted = pd.melt(emissions_intersect, id_vars=['Country', 'Data source', 'Sector', 'Unit'],
                            value_vars=['2010', '2015', '2019'],
                            var_name='Year', value_name='Emissions')


life = pd.read_csv('data/historical_emissions/life-gdp.csv')
life = life.rename(columns={"Country Name": "Country", 'region':"Region", 'All':'Life_expectancy'})
life.head()
Out[15]:
Country Year Life_expectancy Male Female GDP per capita Region sub-region Population
0 Afghanistan 2019 63.2 63.3 63.2 507.103432 Asia Southern Asia 38041754.0
1 Afghanistan 2015 61.7 61.0 62.3 578.466353 Asia Southern Asia 34413603.0
2 Afghanistan 2010 59.9 59.6 60.3 543.303042 Asia Southern Asia 29185507.0
3 Albania 2019 78.0 76.3 79.9 5353.244856 Europe Southern Europe 2854191.0
4 Albania 2015 77.8 76.1 79.7 3952.801215 Europe Southern Europe 2880703.0
In [16]:
life_by_population = life.groupby(['Region'])['Population'].sum().reset_index()

total_population = life_by_population['Population'].sum()

life_by_population['Proportion'] = (life_by_population['Population'] / total_population)
life_by_population.sort_values(by=['Proportion'], ascending=True)
life_by_population.head()
Out[16]:
Region Population Proportion
0 Africa 3.342706e+09 0.139592
1 Americas 2.334736e+09 0.097499
2 Asia 1.550806e+10 0.647620
3 Europe 2.614412e+09 0.109178
4 Oceania 1.463373e+08 0.006111

This bar chart displays populations of sub-regions on the x-axis, with color-coded groupings representing individual regions. This provides a comprehensive view of population distribution, enabling a clear understanding of both sub-regional and regional population concentrations.

In [17]:
pop_subregion = life.groupby(['sub-region', 'Region'])['Population'].sum().reset_index()
In [37]:
region_pop = alt.Chart(pop_subregion).mark_bar().encode(
    x=alt.X('sub-region:N',sort='-y', title = 'Sub-Region'),
    y=alt.Y('Population:Q'),
    color=alt.Color('Region:N'),
    tooltip=['sub-region', 'Population'], 
).properties(
    width=400,
    height=300,
    title='Population by Region and Sub-Region'
)

region_pop
Out[37]:

With a clearer understanding of the population distribution, let's now explore the life expectancy distribution within our dataset to gain insights into overall trends.

In [21]:
# Plot of all observations in the life df for life expentancy
life_expectancy = alt.Chart(life).mark_bar().encode(
    x = alt.X('Life_expectancy', 
              bin = alt.Bin(step = 2), 
              title = 'Life Expectancy at Birth'),
    y = alt.Y('count()', title = 'Counts')).properties(
    width=500,
    height=300
)

life_expectancy
Out[21]:

Next, we present the mean life expectancy for each sub-region, accompanied by a red line parallel to the x-axis, representing the overall mean life expectancy derived from our dataset.

In [23]:
mean_life_by_subregion = life.groupby('sub-region')['Life_expectancy'].mean().reset_index()

# Create a bar chart
chart = alt.Chart(mean_life_by_subregion).mark_bar().encode(
    x=alt.X('sub-region', sort = '-y', title = 'Sub-Region'),
    y='Life_expectancy:Q'
).properties(
    title='Mean Life Expectancy by Sub-Region',
    width=500,
    height=300
)

mean_line = alt.Chart(life).mark_rule(color='red').encode(
    y=alt.Y('mean(Life_expectancy):Q', axis=alt.Axis(title='Mean Life Expectancy'))
)

chart + mean_line
Out[23]:

The table below summarizes the worldwide average life expectancy for each recorded year, providing a glimpse of the corresponding percentage changes.

In [24]:
life_exp_by_region = life.groupby('Year').agg({'Life_expectancy':'mean'}).reset_index()

life_exp_by_region['Life Expectancy Percentage Change'] = (life_exp_by_region['Life_expectancy'].diff() / life_exp_by_region['Life_expectancy'].shift(1)) * 100
life_exp_by_region['Life Expectancy Percentage Change'] = life_exp_by_region['Life Expectancy Percentage Change'].map('{:.2f}%'.format)
life_exp_by_region.head()
Out[24]:
Year Life_expectancy Life Expectancy Percentage Change
0 2000 67.081169 nan%
1 2010 70.096815 4.50%
2 2015 71.756410 2.37%
3 2019 72.806536 1.46%

The subsequent table displays the global mean life expectancy for each recorded year, categorized by sex, allowing for an examination of potential differences between men and women throughout the same timeframe.

In [25]:
life_exp_by_gender = life.groupby('Year').agg({'Male':'mean','Female':'mean'}).reset_index()
life_exp_by_gender_melt = life_exp_by_gender.melt(id_vars='Year', value_vars=['Male', 'Female'], var_name='Gender', value_name='Life Expectancy')
life_exp_by_gender['Male_Percentage_Change'] = (life_exp_by_gender['Male'].diff() / life_exp_by_gender['Male'].shift(1)) * 100
life_exp_by_gender['Female_Percentage_Change'] = (life_exp_by_gender['Female'].diff() / life_exp_by_gender['Female'].shift(1)) * 100

# Add a '%' sign to the columns
life_exp_by_gender['Male_Percentage_Change'] = life_exp_by_gender['Male_Percentage_Change'].map('{:.2f}%'.format)
life_exp_by_gender['Female_Percentage_Change'] = life_exp_by_gender['Female_Percentage_Change'].map('{:.2f}%'.format)
life_exp_by_gender.head()
Out[25]:
Year Male Female Male_Percentage_Change Female_Percentage_Change
0 2000 64.625325 69.629221 nan% nan%
1 2010 67.682803 72.568153 4.73% 4.22%
2 2015 69.358974 74.204487 2.48% 2.25%
3 2019 70.436601 75.236601 1.55% 1.39%

Utilizing the recently created dataframe, we can visually explore the mean life expectancy for each of the four years with available data, categorized by sex, in a bar chart.

In [31]:
exp_gender_graph = alt.Chart(life_exp_by_gender_melt).mark_bar().encode(
    x=alt.X('Gender:N', title = ''),
    y=alt.Y('Life Expectancy:Q', title='Life Expectancy'), 
    color='Gender:N',
    column=alt.Column('Year:O', title='')
).properties(
    title='Mean Life Expectancy by Gender across Years',
    width=100,
    height=300
)

exp_gender_graph
Out[31]:

Here, we generate a scatter plot depicting the relationship between log-transformed population and life expectancy for each country. Additionally, we've organized the countries by region and applied distinct color-coding to provide a more comprehensive visual representation.

In [35]:
scatter_chart = alt.Chart(life).mark_circle().encode(
    x= alt.X('Life_expectancy:Q', title = 'Life Expectancy'),
    y = alt.Y('Population:Q', scale=alt.Scale(type='log'), title = 'Population (log)'),
    color='Region:N'
).properties(
    title='Population vs Life Expectancy by Region',
    width=400,
    height=300
)

scatter_chart
Out[35]:

Investigating correlations between life expectancy and GDP per capita, we depict the two variables in a plot, incorporating color-coding by region and utilizing point size to represent population size. Additionally, regression lines are incorporated for each region, culminating in the following visual representation.

In [41]:
#scatter plot of relation of 'Life_expectancy' and GDP per capita
scatter_gdp = alt.Chart(life).mark_circle().encode(
    x=alt.X('GDP per capita:Q', scale = alt.Scale(type = 'log'), title = 'GDP per capita (log)'),
    y=alt.Y('Life_expectancy:Q', title = 'Life Expectancy'),
    color = 'Region:N',
    size = alt.Size('Population',scale = alt.Scale(type = 'sqrt'))
).properties(
    title='Life Expectancy vs GDP per Capita',
    width=400,
    height=400
)

loess_line = scatter_gdp.transform_loess(
    on = 'GDP per capita', 
    loess = 'Life_expectancy',
    bandwidth = 0.15
).mark_line(color='black')

reg_line = scatter_gdp.transform_regression(
    groupby = ['Region'],
    on = 'GDP per capita', 
    regression = 'Life_expectancy'
).mark_line(color = 'pink')

(scatter_gdp + reg_line).properties(width=400, height=400)
Out[41]:

The presented table displays total GDP per capita sorted in descending order by region, providing insights into regions with the highest GDP per capita within our country pool in the life dataset.

In [42]:
region_gdp = life.groupby('Region')['GDP per capita'].sum().reset_index()
region_gdp.sort_values(by=['GDP per capita'], ascending=False)
Out[42]:
Region GDP per capita
3 Europe 3.716451e+06
2 Asia 1.772185e+06
1 Americas 8.945901e+05
0 Africa 4.276947e+05
4 Oceania 3.903300e+05

Making a slight adjustment to the previous table, the current one presents the mean GDP per capita sorted in descending order by region. Furthermore, we have integrated the life expectancy of the regions, allowing for the observation of any disparities between the mean GDP per capita and life expectancy in each region.

In [43]:
region_life_expectancy = life.groupby('Region').mean(numeric_only=True).drop(columns=['Year','Male','Female',
                                                                                     'Population']).reset_index()
region_life_expectancy.sort_values(by=['GDP per capita'], ascending=False)
Out[43]:
Region Life_expectancy GDP per capita
3 Europe 77.602857 26546.081746
2 Asia 72.394805 11507.692835
4 Oceania 69.105556 10842.500132
1 Americas 74.040187 8360.655397
0 Africa 61.452459 2337.129273

Merging and Joint Investigation of Emissions & Life¶

In this portion of the project, we merge our emissions and life datasets in order to investigate correlations between emissions and various other variables

Initially, we merge the two datasets and filter the observations so that only those with complete records for all four years in the life dataset are incorporated into the merged dataset with emissions data.

In [44]:
y = ['2019','2015','2010','2000', 'Country','Data source','Sector','Unit']
emissions1 = emissions[emissions.columns.intersection(y)]
emissions_melted = pd.melt(emissions1, id_vars=['Country', 'Data source', 'Sector', 'Unit'],
                            value_vars=['2000','2010', '2015', '2019'],
                            var_name='Year', value_name='Emissions')

merged_df = pd.merge(emissions_melted, life, on='Country', how='inner')
merged_df['Year_x'] = merged_df['Year_x'].astype(int)
merged_df['Year_y'] = merged_df['Year_y'].astype(int)
merged_df = merged_df[merged_df['Year_x']==merged_df['Year_y']]
merged_df = merged_df.drop(columns = 'Year_x')
merged_df=merged_df.rename(columns = {'Year_y':"Year"})
merged_df = merged_df.drop(columns = ['Data source', 'Sector', 'Unit'])
merged_df = merged_df.groupby('Country').filter(lambda x: all(year in x['Year'].values for year in [2000, 2010, 2015, 2019]))
merged_df.groupby(['Country','Year']).sum()
merged_df.head()
Out[44]:
Country Emissions Year Life_expectancy Male Female GDP per capita Region sub-region Population
3 China 4221.08 2000 71.6 69.3 74.2 959.372484 Asia Eastern Asia 1.262645e+09
6 China 9887.06 2010 74.9 72.3 77.9 4550.453108 Asia Eastern Asia 1.337705e+09
9 China 11108.86 2015 76.6 73.9 79.9 8066.942635 Asia Eastern Asia 1.371220e+09
12 China 12055.41 2019 77.4 74.7 80.5 10216.630330 Asia Eastern Asia 1.397715e+09
19 India 1477.87 2000 62.1 61.3 62.9 443.314193 Asia Southern Asia 1.056576e+09

Now, a variable representing population in millions can be generated for use in various plots.

In [45]:
xy = merged_df.groupby(['Country','Year'])['Emissions'].sum().reset_index()
xy['Emission_Increase'] = xy.groupby('Country')['Emissions'].diff()
xy.nlargest(10, 'Emission_Increase').reset_index()
merged_df['Population_in_millions'] = merged_df['Population'].astype(int)/1000000
merged_df.head()
Out[45]:
Country Emissions Year Life_expectancy Male Female GDP per capita Region sub-region Population Population_in_millions
3 China 4221.08 2000 71.6 69.3 74.2 959.372484 Asia Eastern Asia 1.262645e+09 1262.645000
6 China 9887.06 2010 74.9 72.3 77.9 4550.453108 Asia Eastern Asia 1.337705e+09 1337.705000
9 China 11108.86 2015 76.6 73.9 79.9 8066.942635 Asia Eastern Asia 1.371220e+09 1371.220000
12 China 12055.41 2019 77.4 74.7 80.5 10216.630330 Asia Eastern Asia 1.397715e+09 1397.715000
19 India 1477.87 2000 62.1 61.3 62.9 443.314193 Asia Southern Asia 1.056576e+09 1056.575549

Due to the disproportionate nature of greenhouse gas emissions worldwide, it can help us gain perspective by filtering out the top 10 countries with the highest mean emissions to view more comparable country emission values. Below we make a scatter plot with population of the non-top 10 emitting nations (in millions) versus mean emissions (square rooted).

In [48]:
mean_emissions = merged_df.groupby('Country')['Emissions'].mean()

# Get the top 10 countries with the highest mean emissions
top_10_countries = mean_emissions.nlargest(10).index

# Create a new table that removes the top 10 countries with the highest mean emissions
filtered_df = merged_df[~merged_df['Country'].isin(top_10_countries)]
filtered_df.head()

pop_in_millions_versus_mean_emissions_sqrt = alt.Chart(merged_df).mark_point().encode(
x = alt.X('Emissions:Q', scale = alt.Scale(type = 'sqrt', zero = False), title = 'Mean Emissions (Square Rooted)'),
y = alt.Y('Population_in_millions', title = 'Population (in millions)')
)

pop_smooth = pop_in_millions_versus_mean_emissions_sqrt.transform_loess('Emissions', 'Population_in_millions',
                             bandwidth = 0.25).mark_line(color = 'black')

# Combining the scatter plot and the smooth line
combined_graph = (pop_in_millions_versus_mean_emissions_sqrt + pop_smooth).properties(width=450, height=400)

combined_graph
Out[48]:
In [50]:
temp = merged_df[['Country','Year','Emissions', 'Life_expectancy']].reset_index()
temp = temp.drop(columns = 'index')
top_10_emitters = temp.groupby('Country').mean().nlargest(10, 'Emissions').reset_index()

Below is a straightforward scatter plot depicting the life expectancy versus log-Emissions (scaled for clarity) for the top 10 countries, with each country color-coded:

In [51]:
alt.Chart(top_10_emitters, title='Life Expectancy vs Emissions').mark_circle().encode(
    x=alt.X('Emissions', scale = alt.Scale(type ='log'), title = 'Emissions (log)'),
    y=alt.Y('Life_expectancy', title = 'Life Expectancy'), 
    color='Country:N').properties(width=400, height=400)
Out[51]:

Now we can adjust our DataFrame to make sure that no countries with negative emissions are listed, as this can't be logged for future charts. Therefore we can set all countries with negative emissions to 0.01 to work around this, and we don't lose much as there are so few countries in the negative and these negative magnitudes are statistically insignificant

In [52]:
# Adjusting to make sure that no countries with negative emissions are listed, as this can't be logged for future charts

merged_df_min_em = merged_df
merged_df_min_em['Emissions'] = merged_df_min_em['Emissions'].apply(lambda x: max(x, 0.01))
merged_df_min_em.head()
Out[52]:
Country Emissions Year Life_expectancy Male Female GDP per capita Region sub-region Population Population_in_millions
3 China 4221.08 2000 71.6 69.3 74.2 959.372484 Asia Eastern Asia 1.262645e+09 1262.645000
6 China 9887.06 2010 74.9 72.3 77.9 4550.453108 Asia Eastern Asia 1.337705e+09 1337.705000
9 China 11108.86 2015 76.6 73.9 79.9 8066.942635 Asia Eastern Asia 1.371220e+09 1371.220000
12 China 12055.41 2019 77.4 74.7 80.5 10216.630330 Asia Eastern Asia 1.397715e+09 1397.715000
19 India 1477.87 2000 62.1 61.3 62.9 443.314193 Asia Southern Asia 1.056576e+09 1056.575549

Next, we generate a scatter plot for countries outside the top 10 emitters, illustrating life expectancy against the square root of GDP per capita (scaled for clarity). Additionally, we overlay a LOESS regression line on the plot.

In [53]:
mean_emissions_min = merged_df_min_em.groupby('Country')['Emissions'].mean()

# Get the top 10 countries with the highest mean emissions
top_10_countries = mean_emissions_min.nlargest(20).index

# Create a new table that removes the top 10 countries with the highest mean emissions
filtered_df1 = merged_df_min_em[~merged_df['Country'].isin(top_10_countries)]
filtered_df1.head()

f_graph = alt.Chart(filtered_df1).mark_point().encode(
x = alt.X('GDP per capita:Q', scale = alt.Scale(type = 'sqrt')),
y = alt.Y('Life_expectancy',scale = alt.Scale(zero = False), title = 'Life Expectancy'))


f_smooth = f_graph.transform_loess('GDP per capita', 'Life_expectancy',
                                  bandwidth = 0.3).mark_line(color = 'black')

(f_graph + f_smooth).properties(width=450, height=400)
Out[53]:

In contrast, this scatter plot illustrates the life expectancy versus log emissions (scaled for clarity) for countries outside the top 10 emitters. The regions are color-coded, and the size of the points represents population size. Additionally, we include another LOESS regression line on the plot.

In [55]:
m = alt.Chart(merged_df_min_em).mark_circle(opacity = 0.5).encode(
    x = alt.X('Emissions', scale = alt.Scale(type = 'log', zero=False)),
    y = alt.Y('Life_expectancy', title = 'Life Expectancy', scale = alt.Scale(zero = False)),
    color = 'Region',
    size = alt.Size('Population', scale = alt.Scale(type = 'sqrt'))
)#.facet(column = 'Year')

m + m.transform_loess(
    on = 'Emissions', 
    loess = 'Life_expectancy',
    bandwidth = 0.4
).transform_calculate(
        Region='"LOESS line"'
).mark_line(color='black').properties(width=400, height=400)
Out[55]:
In [56]:
merged_mf = merged_df.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])

To explore the correlations between our variables, we can generate a heatmap-style correlation matrix, as depicted belo:w

In [57]:
corr_mx = merged_mf.corr()
# melt corr_mx
corr_mx_long = corr_mx.reset_index().rename(
    columns = {'index': 'row'}
).melt(
    id_vars = 'row',
    var_name = 'col',
    value_name = 'Correlation'
)

# construct plot
alt.Chart(corr_mx_long).mark_rect().encode(
    x = alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}), 
    y = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    color = alt.Color('Correlation', 
                      scale = alt.Scale(scheme = 'blueorange', # diverging gradient
                                        domain = (-1, 1), # ensure white = 0
                                        type = 'sqrt'), # adjust gradient scale
                     legend = alt.Legend(tickCount = 5)) # add ticks to colorbar at 0.5 for reference
).properties(width = 300, height = 300)
Out[57]:

The visual display illustrates a positive correlation between GDP per capita and population, as well as between population and emissions, as emphasized in the above plot. Additionally, there are several other relationships, which we will delve into further later on.

Regression Analysis¶

To understand trends in emissions against our numerical predictors, we set up a linear regression model to analyze the trends and correlations between our variables. We first cut down our dataset into one we can use for our regression analysis, keeping only our numerical response variable: ‘Emissions’, and our numerical predictors: ‘Life_expectancy’, ‘Population’, and ‘GDP per capita’. From there, we create a Multiple Linear Regression Model containing all 3 of our predictors, then we cut it down to 2 predictors, considering every combination of predictors, and then create 3 separate simple regression models. With every model, we are able to extract the R^2 value, which captures how large or how little variance the model predicts of the response variable. The R^2 value is often used as a goodness-of-fit measure to assess how well the independent variables explain the variability in the dependent variable(s).

In [36]:
reg_data = merged_df.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
reg_data = reg_data.rename(columns = {'GDP per capita':'GDP_per_capita'})
reg_data.head()
Out[36]:
Emissions Life_expectancy GDP_per_capita Population
3 4221.08 71.6 959.372484 1.262645e+09
6 9887.06 74.9 4550.453108 1.337705e+09
9 11108.86 76.6 8066.942635 1.371220e+09
12 12055.41 77.4 10216.630330 1.397715e+09
19 1477.87 62.1 443.314193 1.056576e+09
In [37]:
# Treat emissions as variable of interest
# MLR 1

# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['Life_expectancy', 'GDP_per_capita','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()

coef_tbl = pd.DataFrame({
    'estimate': rslt.params.values,
    'standard error': np.sqrt(rslt.cov_params().values.diagonal()) 
    },
    index = x.columns 
)

coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
Out[37]:
estimate standard error
const -402.176213 1.781174e+02
Life_expectancy 5.741325 2.647596e+00
GDP_per_capita 0.001480 1.352472e-03
Population 0.000005 1.249454e-07
error variance 211721.361782 NaN
In [38]:
rslt.rsquared
Out[38]:
0.7199793960178814
In [39]:
#Treat emissions as variable of interest
#MLR 2

# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['GDP_per_capita','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()

coef_tbl = pd.DataFrame({
    'estimate': rslt.params.values,
    'standard error': np.sqrt(rslt.cov_params().values.diagonal()) 
    },
    index = x.columns 
)

coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
rslt.rsquared
Out[39]:
0.7176933327761046
In [40]:
#Treat emissions as variable of interest
#MLR 3

# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['Life_expectancy','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()

coef_tbl = pd.DataFrame({
    'estimate': rslt.params.values,
    'standard error': np.sqrt(rslt.cov_params().values.diagonal()) 
    },
    index = x.columns 
)

coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
rslt.rsquared
Out[40]:
0.719397247629431
In [41]:
#Treat emissions as variable of interest
#MLR 4

# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['Life_expectancy','GDP_per_capita']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()

coef_tbl = pd.DataFrame({
    'estimate': rslt.params.values,
    'standard error': np.sqrt(rslt.cov_params().values.diagonal()) 
    },
    index = x.columns 
)

coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
rslt.rsquared
Out[41]:
0.012482771328017228
In [42]:
#SLR (population)
# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.Population)

# print first five rows of x
slr = sm.OLS(endog = y, exog = x)

rslt = slr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
              'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
              index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
print(coef_tbl)
print('-------')
print('R^2 = ',rslt.rsquared)
                     estimate  standard error
const               20.110843    1.991323e+01
Population           0.000005    1.255190e-07
error variance  215918.570765             NaN
-------
R^2 =  0.7134366397152028
In [43]:
#SLR (Life Expectancy)
# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.Life_expectancy)

# print first five rows of x
slr = sm.OLS(endog = y, exog = x)

rslt = slr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
              'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
              index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
print(coef_tbl)
print('-------')
print('R^2 = ',rslt.rsquared)
                      estimate  standard error
const              -432.766747      284.093073
Life_expectancy       9.082499        4.001055
error variance   746817.775677             NaN
-------
R^2 =  0.008836476822486672
In [44]:
#SLR (GDP_per_capita)
# retrieve response
y = reg_data.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.GDP_per_capita)

# print first five rows of x
slr = sm.OLS(endog = y, exog = x)

rslt = slr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
              'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
              index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale

coef_tbl
print(coef_tbl)
print('-------')
print('R^2 = ',rslt.rsquared)
                     estimate  standard error
const              202.964997       43.575563
GDP_per_capita       0.000335        0.002048
error variance  753441.041101             NaN
-------
R^2 =  4.619449856890778e-05

After finding all R^2 values, we can compare them side by side in a neatly formatted table.

In [45]:
Model = ['MLR', 'MLR','MLR','MLR','SLR','SLR','SLR']
Num_predictors = [3, 2, 2, 2, 1, 1, 1, 1]
Predictor_names = ['Life expectancy + GDP per capita + Population',
                   'GDP_per_capita + Population',
                   'Life_expectancy + Population',
                   'Life_expectancy + GDP_per_capita',
                   'Population','Life expectancy','GDP per capita']
R_sqr_val = [0.7199793960178814, 
             0.7176933327761046, 0.719397247629431,0.012482771328017228,
            0.7134366397152028, 0.008836476822486672,4.619449856890778e-05 ]


data = {
    'Model': ['MLR', 'MLR', 'MLR', 'MLR', 'SLR', 'SLR', 'SLR'],
    'Num_predictors': [3, 2, 2, 2, 1, 1, 1],
    'Predictor_names': ['Life expectancy + GDP per capita + Population',
                        'GDP_per_capita + Population',
                        'Life_expectancy + Population',
                        'Life_expectancy + GDP_per_capita',
                        'Population', 'Life expectancy', 'GDP per capita'],
    'R_sqr_val': [0.7199793960178814,
                  0.7176933327761046, 0.719397247629431, 0.012482771328017228,
                  0.7134366397152028, 0.008836476822486672, 4.619449856890778e-05]
}

Lin_reg = pd.DataFrame(data)
Lin_reg.style.set_caption("Linear Regression for Emissions vs Predictors")
Out[45]:
Linear Regression for Emissions vs Predictors
  Model Num_predictors Predictor_names R_sqr_val
0 MLR 3 Life expectancy + GDP per capita + Population 0.719979
1 MLR 2 GDP_per_capita + Population 0.717693
2 MLR 2 Life_expectancy + Population 0.719397
3 MLR 2 Life_expectancy + GDP_per_capita 0.012483
4 SLR 1 Population 0.713437
5 SLR 1 Life expectancy 0.008836
6 SLR 1 GDP per capita 0.000046

Regression for Top 10 Emission Countries¶

When graphing the top 10 countries that produce the greatest amount of green house gases, we thought that maybe they produce a lot of weight into the results that are given in the analysis of the main dataset. To get a better understanding of how these smaller countries compare to the larger one, we thought to recreate the linear regression analysis on split datasets with the top 10 emitting countries against the others.

First, we make a list of the top 10 countries with the highest emission values using the mean of all the years, and use indexing to create a dataset of only the top 10 countries, and another dataset without them, and group them by year and country.

In [60]:
top_10_countries = merged_df.groupby('Country')['Emissions'].mean().nlargest(10).index
emissions_top_10 = merged_df[merged_df['Country'].isin(top_10_countries)]
emissions_without_top_10 = merged_df[~merged_df['Country'].isin(top_10_countries)]
emissions_top_10=emissions_top_10.groupby(['Country','Year']).sum().reset_index()
emissions_top_10.head()
Out[60]:
Country Year Emissions Life_expectancy Male Female GDP per capita Region sub-region Population Population_in_millions
0 Australia 2000 650.73 79.7 77.1 82.2 21679.24784 Oceania Australia and New Zealand 19153000.0 19.153000
1 Australia 2010 596.36 81.9 79.8 84.0 52022.12560 Oceania Australia and New Zealand 22031750.0 22.031750
2 Australia 2015 568.87 82.3 80.4 84.2 56755.72171 Oceania Australia and New Zealand 23815995.0 23.815995
3 Australia 2019 608.49 83.0 81.3 84.8 55060.32610 Oceania Australia and New Zealand 25364307.0 25.364307
4 Brazil 2000 1809.18 71.5 67.9 75.2 3749.75325 Americas Latin America and the Caribbean 174790340.0 174.790340
In [61]:
emissions_without_top_10.head()
Out[61]:
Country Emissions Year Life_expectancy Male Female GDP per capita Region sub-region Population Population_in_millions
163 South Africa 380.86 2000 55.8 53.1 58.6 3032.427138 Africa Sub-Saharan Africa 44967708.0 44.967708
166 South Africa 540.67 2010 57.1 55.0 59.2 7328.615629 Africa Sub-Saharan Africa 51216964.0 51.216964
169 South Africa 542.51 2015 62.6 59.8 65.3 5734.633629 Africa Sub-Saharan Africa 55386367.0 55.386367
172 South Africa 562.19 2019 65.3 62.2 68.3 6001.400814 Africa Sub-Saharan Africa 58558270.0 58.558270
179 Pakistan 240.86 2000 60.1 59.3 61.0 576.195601 Asia Southern Asia 142343578.0 142.343578

To put our theory into practice, we create visualizations of different comparisons and trends between the variables in our dataset. We compare the life expectancy against emissions to evaluate how greater emission rates affect the life expectancy of these higher emission countries versus the smaller, and include a linear trend to display the pattern.

In [65]:
#2 
emissions_top_10_graph = alt.Chart(emissions_top_10).mark_circle().encode(
    x=alt.X('Emissions:Q', scale = alt.Scale(type='sqrt', zero=False)),
    y='Life_expectancy:Q',
    color='Country:N',
    size=alt.Size('Population:Q', scale = alt.Scale(type = 'sqrt'))
).properties(width=400, height=300)

t10 = emissions_top_10_graph.transform_regression('Emissions', 'Life_expectancy').mark_line(color = 'black')


t10em = emissions_top_10_graph + t10

emissions_not_top_10_graph = alt.Chart(emissions_without_top_10).mark_circle().encode(
    x=alt.X('Emissions:Q', scale = alt.Scale(type='sqrt', zero=False)),
    y='Life_expectancy:Q',
    color='Country:N',
    size=alt.Size('Population:Q', scale = alt.Scale(type = 'sqrt'))
).properties(width=400, height=300)

nt10 = emissions_not_top_10_graph.transform_regression('Emissions', 'Life_expectancy').mark_line(color = 'black')


nt10em = emissions_not_top_10_graph + nt10
alt.vconcat(t10em, nt10em)
Out[65]:

Next, we depict the relationship between emissions against GDP per capita, plotting each point of each country, finding the mean GDP per capita over the years and observing how it affects the emissions output.

In [66]:
emissions_top_10_gdp_graph = alt.Chart(emissions_top_10).mark_circle().encode(
    x='GDP per capita:Q',
    y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
    color='Country:N',
    tooltip=['Country', 'Year', 'GDP per capita', 'Emissions', 'Population']
).properties(width=400, height=300)

bline = emissions_top_10_gdp_graph.transform_loess('GDP per capita', 'Emissions',
                             bandwidth = 0.45).mark_line(color = 'black')

emissions_not_top_10_gdp_graph = alt.Chart(emissions_without_top_10).mark_circle().encode(
    x='GDP per capita:Q',
    y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
    color='Country:N',
    tooltip=['Country', 'Year', 'GDP per capita', 'Emissions', 'Population']
).properties(width=400, height=300)

eline = emissions_not_top_10_gdp_graph.transform_loess('GDP per capita', 'Emissions',
                             bandwidth = 0.45).mark_line(color = 'black')


alt.vconcat(emissions_top_10_gdp_graph+bline, emissions_not_top_10_gdp_graph + eline)
Out[66]:

In addition, we plot emissions against population, using a logarithmic scale on the emissions to incorporate equally the smaller emissions from the smaller countries.

In [68]:
xtv = alt.Chart(emissions_top_10).mark_circle().encode(
    x='Population:Q',
    y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
    color='Country:N',
).properties(width=400, height=300)

ytv = alt.Chart(emissions_without_top_10).mark_circle().encode(
    x='Population:Q',
    y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
    color='Country:N',
).properties(width=400, height=300)

vline = xtv.transform_loess('Population', 'Emissions',
                             bandwidth = 0.45).mark_line(color = 'black')
vreg = ytv.transform_regression('Population', 'Emissions').mark_line(color = 'black')

yline = ytv.transform_loess('Population', 'Emissions',
                             bandwidth = 0.45).mark_line(color = 'black')

yreg = ytv.transform_regression('Population', 'Emissions').mark_line(color = 'black')

alt.vconcat(xtv+vreg, ytv+yreg)
Out[68]:

As these visualizations narrate the variations in how emissions are influenced in the top countries compared to the smaller ones, we proceed with another step in linear regression. Utilizing our segregated data, we exclude all non-numerical data, retaining only Emissions, Life Expectancy, Population, and GDP per capita. Subsequently, we conduct multiple linear regression (MLR) and simple linear regression (SLR) for both sets of high and low emission countries, presenting the results in tables at the conclusion, inclusive of our R^2 values.

In [69]:
reg_t10 = emissions_top_10.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
reg_t10 = reg_t10.rename(columns = {'GDP per capita':'GDP_per_capita'})

reg_nt10 = emissions_without_top_10.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
reg_nt10 = reg_nt10.rename(columns = {'GDP per capita':'GDP_per_capita'})
In [70]:
# Treat emissions as variable of interest
# MLR 1

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['Life_expectancy', 'GDP_per_capita','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[70]:
0.7413651601135549
In [71]:
#MLR 2

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['Life_expectancy', 'Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[71]:
0.6878591169919026
In [53]:
#Treat emissions as variable of interest
#MLR 3

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['Life_expectancy', 'GDP_per_capita']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[53]:
0.24329757166908683
In [54]:
#Treat emissions as variable of interest
#MLR 4

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['GDP_per_capita','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[54]:
0.6223115523941747
In [55]:
#Treat emissions as variable of interest
#SLR 1

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.Life_expectancy)

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[55]:
0.014584777637709712
In [56]:
#Treat emissions as variable of interest
#SLR 2

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.Population)

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[56]:
0.6140040826069113
In [57]:
#Treat emissions as variable of interest
#SLR 3

# retrieve response
y = reg_t10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.GDP_per_capita)

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[57]:
0.11989222540865618
In [58]:
Model = ['MLR', 'MLR','MLR','MLR','SLR','SLR','SLR']
Num_predictors = [3, 2, 2, 2, 1, 1, 1, 1]
Predictor_names = ['Life expectancy + GDP per capita + Population',
                   'GDP_per_capita + Population',
                   'Life_expectancy + Population',
                   'Life_expectancy + GDP_per_capita',
                   'Population','Life expectancy','GDP per capita']
R_sqr_val_t10 = [0.7413651601135549, 
             0.6223115523941747, 0.6878591169919026,0.24329757166908683,
            0.6140040826069113, 0.014584777637709712,0.11989222540865618 ]
                   
                   
data_t10= {
    'Model': ['MLR', 'MLR', 'MLR', 'MLR', 'SLR', 'SLR', 'SLR'],
    'Num_predictors': [3, 2, 2, 2, 1, 1, 1],
    'Predictor_names': ['Life expectancy + GDP per capita + Population',
                        'GDP_per_capita + Population',
                        'Life_expectancy + Population',
                        'Life_expectancy + GDP_per_capita',
                        'Population', 'Life expectancy', 'GDP per capita'],
    'R^2 of Top 10 Emission Countries': [0.7413651601135549, 
             0.6223115523941747, 0.6878591169919026,0.24329757166908683,
            0.6140040826069113, 0.014584777637709712,0.11989222540865618 ]
}

Regression for Remaining Emission Countries¶

The procedure for this is identical to the one outlined above.

In [59]:
#Treat emissions as variable of interest
#MLR 1

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['Life_expectancy', 'GDP_per_capita','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[59]:
0.4978334100816997
In [60]:
#Treat emissions as variable of interest
#MLR 2

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['Life_expectancy','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[60]:
0.4943751865686107
In [61]:
#Treat emissions as variable of interest
#MLR 3

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['Life_expectancy', 'GDP_per_capita']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[61]:
0.01932313749396042
In [62]:
#Treat emissions as variable of interest
#MLR 4

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['GDP_per_capita','Population']])

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[62]:
0.48446928491764774
In [63]:
#Treat emissions as variable of interest
#SLR 1

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.Life_expectancy)

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[63]:
0.018829484767021176
In [64]:
#Treat emissions as variable of interest
#SLR 2

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.Population)

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[64]:
0.4614267534668359
In [65]:
#Treat emissions as variable of interest
#SLR 3

# retrieve response
y = reg_nt10.Emissions

# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.GDP_per_capita)

# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)

# store results
rslt = mlr.fit()
rslt.rsquared
Out[65]:
0.0036404170398194324
In [66]:
Model = ['MLR', 'MLR','MLR','MLR','SLR','SLR','SLR']
Num_predictors = [3, 2, 2, 2, 1, 1, 1, 1]
Predictor_names = ['Life expectancy + GDP per capita + Population',
                   'GDP_per_capita + Population',
                   'Life_expectancy + Population',
                   'Life_expectancy + GDP_per_capita',
                   'Population','Life expectancy','GDP per capita']
R_sqr_val_nt10 = [0.49783341008169957, 
             0.4844692849176476, 0.4943751865686107,0.019323137493960307,
            0.4614267534668358, 0.018829484767021065,0.0036404170398192104]
                   
                   
data_nt10= {
    'Model': ['MLR', 'MLR', 'MLR', 'MLR', 'SLR', 'SLR', 'SLR'],
    'Num_predictors': [3, 2, 2, 2, 1, 1, 1],
    'Predictor_names': ['Life expectancy + GDP per capita + Population',
                        'GDP_per_capita + Population',
                        'Life_expectancy + Population',
                        'Life_expectancy + GDP_per_capita',
                        'Population', 'Life expectancy', 'GDP per capita'],
    'R^2 of Other Countries': [0.49783341008169957, 
             0.4844692849176476, 0.4943751865686107,0.019323137493960307,
            0.4614267534668358, 0.018829484767021065,0.0036404170398192104]
}
In [67]:
Lin_reg_top_10_emission = pd.DataFrame(data_t10)  
Lin_reg_top_10_emission.style.set_caption("Linear Regression for Emissions vs Predictors of Top 10 Emission Countries")  
Out[67]:
Linear Regression for Emissions vs Predictors of Top 10 Emission Countries
  Model Num_predictors Predictor_names R^2 of Top 10 Emission Countries
0 MLR 3 Life expectancy + GDP per capita + Population 0.741365
1 MLR 2 GDP_per_capita + Population 0.622312
2 MLR 2 Life_expectancy + Population 0.687859
3 MLR 2 Life_expectancy + GDP_per_capita 0.243298
4 SLR 1 Population 0.614004
5 SLR 1 Life expectancy 0.014585
6 SLR 1 GDP per capita 0.119892
In [68]:
Lin_reg_not_top_10_emission = pd.DataFrame(data_nt10)        
Lin_reg_not_top_10_emission.style.set_caption("Linear Regression for Emissions vs Predictors of Other Countries")  
Out[68]:
Linear Regression for Emissions vs Predictors of Other Countries
  Model Num_predictors Predictor_names R^2 of Other Countries
0 MLR 3 Life expectancy + GDP per capita + Population 0.497833
1 MLR 2 GDP_per_capita + Population 0.484469
2 MLR 2 Life_expectancy + Population 0.494375
3 MLR 2 Life_expectancy + GDP_per_capita 0.019323
4 SLR 1 Population 0.461427
5 SLR 1 Life expectancy 0.018829
6 SLR 1 GDP per capita 0.003640

Summary of Findings¶

We delve into the relationships between the variables in question in our datasets, emissions, and life to determine how one might affect the other. Our initial observations reveal a notable surge in greenhouse gas emissions globally, with the most significant increase occurring in the early 2000s. Despite attempts to identify geographic patterns, emissions do not exhibit an apparent spillover across national borders, except for discernible disparities in emission levels among regions.

When we checked the emissions data, it was clear that some countries had much higher emission values than the rest. To understand this better, we took a closer look at the top 10 countries with the highest emissions and grouped the others into a category we called "Other." Plotting the yearly emissions after this grouping showed us how much impact the top emitters had compared to the remaining 185 countries. The resulting chart made us question whether the overall emissions results represented the smaller countries accurately. Later on, we will specifically analyze this difference via our regression analyses.

Turning to the life dataset, we observed a higher population count in the Asian region, totaling 1.55 trillion people and constituting 65% of the dataset's overall population. Consequently, it is reasonable to anticipate that findings from the Asian region will significantly influence the dataset's overall outcomes. The average life expectancy of 72.45 years showed a proportional increase of 0.27 years over the 20-year dataset period. A gender-based analysis revealed an intriguing trend where male life expectancy rose quicker than females, with an increase of 5.8 years for males versus 5.6 years for females. Notably, the percentage increase in life expectancy tends to diminish over the years, declining from a 2.37% rise between 2010 and 2015 to a 1.46% increase from 2015 to 2019. This decline may be associated with the increasing levels of greenhouse gas emissions. In our investigation, we explored external online sources referenced in the paragraph. Subsequent research highlighted that fossil fuels such as coal, oil, and gas significantly contribute to global climate change, accounting for over 75 percent of global greenhouse gas emissions and nearly 90 percent of all carbon dioxide emissions. Understanding the origin of these emissions and how fossil fuels enter the atmosphere is crucial. Information from www.un.org indicates that critical contributors to greenhouse gas emissions include power generation, manufacturing goods, deforestation, and transportation – all indicative of a thriving economy. Consequently, we undertook a comparative analysis, juxtaposing Gross Domestic Product (GDP) with life expectancy and examining GDP trends across various regions. This exploration revealed a positive correlation, demonstrating an increase in GDP corresponding to elevated life expectancy in each region.

Combining the life and emissions datasets facilitated a direct comparison between life expectancy and emissions, visualized on a shared graph. After organizing and refining the data, a distinct positive correlation emerged, indicating that higher life expectancy among a country's citizens tends to align with increased greenhouse gas emissions. This correlation is further associated with the observed rise in life expectancy concurrent with increased GDP per capita. These findings suggest a connection between emissions growth, GDP expansion, and prolonged life expectancy. A heatmap illustrating the numerical relationships within our dataset was employed to represent these correlations visually.

We claimed that these factors contribute to emission increases, so we conducted several linear regression models to understand which combinations of variables influence emissions. Following these analyses, the multiple linear regression incorporating all three numeric variables—population, life expectancy, and GDP—yielded the highest R^2 value, indicating that the model explains approximately 80% of the data points through the regression line. In simple linear regression and individual variable analysis, population emerged as the most significant predictor when forecasting emissions.

As mentioned earlier, we highlighted that the top 10 emitting countries significantly influence overall trends in emission production, life expectancy, and population growth. When analyzing these countries separately from the least emitting ones, distinct patterns emerge. Firstly, regarding the correlation between emissions and life expectancy, heavy-emitting countries exhibit a negative trend in life expectancy. In contrast, low-emitting countries show a positive correlation—life expectancy increases with rising emissions. Similarly, when comparing GDP to emissions, high-emitting countries demonstrate a negative trend as GDP increases, which is noteworthy. Conversely, low-emitting countries display a somewhat different trend, where very low GDP initially leads to increased emissions, then trends negatively as their total GDP approaches the mean. Lastly, comparing population and emissions reveals a sharp increase in emissions with population growth in high-emitting countries, while low-emitting countries experience a more gradual rise in emissions.

With this new information, we conducted linear regression on the divided dataset for high and low-emitting countries to identify key factors. In the case of high emitting countries, after running various regressions, the multiple linear regression, encompassing all three numeric variables (population, life expectancy, and GDP), produced the highest R^2 value, indicating that the model explains approximately 74% of the data points along the regression line. Turning to low-emitting countries, both models—utilizing three predictors and two predictors (life expectancy and population)—exhibited similar explanatory variability, ranging between 49% and 50%. When considering simple linear regression and evaluating each variable independently, population emerged as the most significant predictor for emissions in low and high-emitting countries.

In summary, our thorough analysis has provided a conclusive response to our question of interest concerning emissions and life expectancy. Despite variations in magnitude across different regions, we observed a consistent positive correlation between life expectancy and emissions. This alignment is not surprising, considering the influence of additional factors in our datasets. Population value also positively correlates with emissions, interconnected with GDP per capita. Our linear regression underscores the importance of considering all numerical variables for accurate emission predictions, emphasizing the complexity of relying solely on a single factor. It intuitively makes sense that emissions and GDP are positively correlated, as economic activities like manufacturing and deforestation contribute to both. Moreover, a higher population in a region leads to increased transportation and greenhouse gas emissions, and a rising GDP indicates improved healthcare practices, contributing to an extended life expectancy in a flourishing economy.